chore(buffers): Rework counter/gauge increment handling by bruceg · Pull Request #23542 · vectordotdev/vector

bruceg · 2025-08-06T20:03:55Z

Summary

This is a step-by-step refactoring of the changes made in #23453 and #23507 in order to remove the need for the shared BUFFER_COUNTERS dashmap. While the whole change is rather large, I attempted to keep each commit from altering externally-visible behavior, with the exception of additional labels in the buffer create gauges. As such, reviewing each individual commit is likely more useful than reviewing the whole code change.

However, this leaves a bit of a conundrum: If we look at the changes to lib/vector-buffers/src starting immediately before the above changes (have to manually scroll down to the buffer source files, the direct GitHub link doesn't quite work), the result doesn't entirely make sense:

The metrics have the buffer_id added to disambiguate metrics with the same ID but different stages. I thought, however, that the buffers would have been created in the context of a component and, as such, would already have the component ID in their tags. If not, this is a real bug, maybe even the real bug. Since the buffer_id corresponds to a component ID, maybe that particular tag name is less than ideal.
The counters are updated using a saturating add which prevents them from overflowing, which is good, but does nothing to deal with negatives which were the heart of Vector buffer metrics report negative values #21702 that prompted the above two PRs.
The gauges have their values set instead of incremented from the same values. This is the part that really doesn't make sense to me, but is an outflow of all the refactoring.

Vector configuration

N/A

How did you test this PR?

Tested using the unit tests, all of the above should be behavior-preserving.

Change Type

Bug fix
New feature
Non-functional (chore, refactoring, docs)
Performance

Is this a breaking change?

Yes
No

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the no-changelog label to this PR.

References

Notes

Please read our Vector contributor resources.
Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
Some CI checks run only after we manually approve them.
- We recommend adding a pre-push hook, please see this template.
- Alternatively, we recommend running the following locally before pushing to the remote branch:
  - cargo fmt --all
  - cargo clippy --workspace --all-targets -- -D warnings
  - cargo nextest run --workspace (alternatively, you can run cargo test --all)
After a review is requested, please avoid force pushes to help us review incrementally.
- Feel free to push as many commits as you want. They will be squashed into one before merging.
- For example, you can run git merge origin master and git push.
If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
run cargo vdev build licenses to regenerate the license inventory and commit the changes (if any). More details here.

The counters that come in from the internal events can never contain negative values, so handling them is completely unnecessary.

Note that the tests on the state of the gauges have been dropped as they now would only test the proper operation of the metrics recorder.

This ends up dropping all the new tests, as they just worked on the `BUFFER_COUNTERS` map which was able to be removed with this change.

The values in the `total_*` counters are manipulated identically to the other fields, so we can drop them now.

lib/vector-buffers/src/buffer_usage_data.rs

pront

The gauges have their values set instead of incremented from the same values. This is the part that really doesn't make sense to me, but is an outflow of all the refactoring.

This is concerning but also it an existing issue.

vparfonov · 2025-08-06T22:08:40Z

2. The counters are updated using a saturating add which prevents them from overflowing, which is good, but does nothing to deal with negatives which were the heart of Vector buffer metrics report negative values #21702 that prompted the above two PRs.

@bruceg
Take a look please here: https://github.com/vectordotdev/vector/pull/23453/files#diff-cd5d7e9d7dc7fc3347ca77c5322795f9115cfc8d0a71375d91de45d9d810ff51R20
value will be in range: [0, i64::MAX]

vparfonov · 2025-08-06T22:14:09Z

The gauges have their values set instead of incremented from the same values. This is the part that really doesn't make sense to me, but is an outflow of all the refactoring.

This is concerning but also it an existing issue.

The main reason to use set() instead of increment() is to prevent the value from becoming negative. The set() method allows us to validate the number first, ensuring it's always a correct and safe value.

bruceg · 2025-08-06T22:30:03Z

The counters are updated using a saturating add which prevents them from overflowing, which is good, but does nothing to deal with negatives which were the heart of Vector buffer metrics report negative values #21702 that prompted the above two PRs.

@bruceg Take a look please here: https://github.com/vectordotdev/vector/pull/23453/files#diff-cd5d7e9d7dc7fc3347ca77c5322795f9115cfc8d0a71375d91de45d9d810ff51R20 value will be in range: [0, i64::MAX]

Are you saying that we are seeing values that are wrapping around an i64? That is, byte sizes totals larger than 9 exabytes (9e18)? I'm not clear the point you are making here.

bruceg · 2025-08-06T22:31:19Z

The main reason to use set() instead of increment() is to prevent the value from becoming negative. The set() method allows us to validate the number first, ensuring it's always a correct and safe value.

If I traced the logic right, the upstream values for those increments are always positive as they are communicated in unsigned integers. Did I mix up a subtraction?

vparfonov · 2025-08-07T06:46:26Z

The counters are updated using a saturating add which prevents them from overflowing, which is good, but does nothing to deal with negatives which were the heart of Vector buffer metrics report negative values #21702 that prompted the above two PRs.

@bruceg Take a look please here: https://github.com/vectordotdev/vector/pull/23453/files#diff-cd5d7e9d7dc7fc3347ca77c5322795f9115cfc8d0a71375d91de45d9d810ff51R20 value will be in range: [0, i64::MAX]

Are you saying that we are seeing values that are wrapping around an i64? That is, byte sizes totals larger than 9 exabytes (9e18)? I'm not clear the point you are making here.

"You mentioned this doesn't deal with negatives, but I point to clamp(0, i64::MAX). This part of the code gives us the assurance that the result will not be negative."

vparfonov · 2025-08-07T06:52:26Z

The main reason to use set() instead of increment() is to prevent the value from becoming negative. The set() method allows us to validate the number first, ensuring it's always a correct and safe value.

If I traced the logic right, the upstream values for those increments are always positive as they are communicated in unsigned integers. Did I mix up a subtraction?

Maybe my previous note was a bit unclear. The change from increment() to set() isn't about a single operation going negative. It's about making the metric more robust.

The old increment-based approach is fragile because it assumes we'll never miss an event and get them in correct time. With set(), we're reporting the actual, current state. This makes the metric self-healing and immune to missed events or race conditions.

vparfonov · 2025-08-07T09:12:59Z

I think I found a small but critical issue. The gauge is not tracking the total buffer's size, it's only tracking the size of the most recent batch of events.
I added a new unit test that demonstrates this behavior.

For reference, here is the approach from my PR: https://github.com/vectordotdev/vector/pull/23518/files#diff-69337e368c06496cca89f39bdf229fb2ebe3c0413c40f3fc0cd9d19d4fdea177R53

/cc @bruceg @pront

    #[test]
    fn test_event_and_byte_size_gauges_are_overwritten() {
        use metrics::{Key, Label};
        use metrics_util::debugging::{DebugValue, DebuggingRecorder};
        use metrics_util::{CompositeKey, MetricKind};
        use ordered_float::OrderedFloat;

        let recorder = DebuggingRecorder::default();
        let snapshotter = recorder.snapshotter();
        let buffer_id = "test_buffer_overwrite";
        let stage_idx = 0;

        metrics::with_local_recorder(&recorder, || {
            emit(BufferEventsReceived {
                buffer_id: buffer_id.to_string(),
                idx: stage_idx,
                count: 100,
                byte_size: 2000,
            });
            emit(BufferEventsSent {
                buffer_id: buffer_id.to_string(),
                idx: stage_idx,
                count: 5,
                byte_size: 100,
            });
        });

        let expected_events_count = 100.0 - 5.0; //95.0
        let expected_byte_size = 2000.0 - 100.0; //1900.0

        let metrics = snapshotter.snapshot().into_vec();

        let events_key = CompositeKey::new(
            MetricKind::Gauge,
            Key::from_parts(
                "buffer_events",
                vec![
                    Label::new("buffer_id", buffer_id.to_string()),
                    Label::new("stage", stage_idx.to_string()),
                ],
            ),
        );
        let final_events_gauge = metrics
            .iter()
            .find(|(key, _, _, _)| key == &events_key)
            .map(|(_, _, _, value)| value)
            .unwrap();

        assert_eq!(
            *final_events_gauge,
            DebugValue::Gauge(OrderedFloat(expected_events_count))
        );

        let bytes_key = CompositeKey::new(
            MetricKind::Gauge,
            Key::from_parts(
                "buffer_byte_size",
                vec![
                    Label::new("buffer_id", buffer_id.to_string()),
                    Label::new("stage", stage_idx.to_string()),
                ],
            ),
        );
        let final_bytes_gauge = metrics
            .iter()
            .find(|(key, _, _, _)| key == &bytes_key)
            .map(|(_, _, _, value)| value)
            .unwrap();

        assert_eq!(
            *final_bytes_gauge,
            DebugValue::Gauge(OrderedFloat(expected_byte_size))
        );
    }

pront · 2025-08-07T12:56:37Z

I do appreciate the discussion here. Let's allow enough time to get this right. While we are busy preparing the next release, I have one time sensitive question. Are any of the findings here indicate that we have a gap on origin/master vs the previous Vector release? I tried going over everything and I think the answer is "no, we can go ahead with the current state of origin/master" but I wanted to double check with you folks. Thank you.

vparfonov · 2025-08-07T13:24:15Z

I do appreciate the discussion here. Let's allow enough time to get this right. While we are busy preparing the next release, I have one time sensitive question. Are any of the findings here indicate that we have a gap on origin/master vs the previous Vector release? I tried going over everything and I think the answer is "no, we can go ahead with the current state of origin/master" but I wanted to double check with you folks. Thank you.

I believe we can go ahead, but @bruceg voice will be more valuable

bruceg · 2025-08-07T18:49:24Z

"You mentioned this doesn't deal with negatives, but I point to clamp(0, i64::MAX). This part of the code gives us the assurance that the result will not be negative."

Granted, yes. However, the values being added are always converted from unsigned and so start as positive, so the saturating_add should already result with a positive.

Maybe my previous note was a bit unclear. The change from increment() to set() isn't about a single operation going negative. It's about making the metric more robust.

The old increment-based approach is fragile because it assumes we'll never miss an event and get them in correct time. With set(), we're reporting the actual, current state. This makes the metric self-healing and immune to missed events or race conditions.

Agree 100% here. My concern is that it appears, when all the refactoring is done, that they are set from the same value that had previously been used to increment them. That part doesn't make sense to me.

I do appreciate the discussion here. Let's allow enough time to get this right. While we are busy preparing the next release, I have one time sensitive question. Are any of the findings here indicate that we have a gap on origin/master vs the previous Vector release?

I think, given the above, there has been a regression merged albeit one that is not tested. It would be great to set up some test cases that assert the correct behavior, though, to be able to say for sure. The test above is a good start but unfortunately does not validate that the buffer mechanism itself adjusts the gauges to the correct values.

pront · 2025-08-07T20:24:41Z

FYI: To mitigate risk, we’ll revert the merged changes in #23556.
This will allow us to address the buffer metrics issues properly without introducing regressions.

vparfonov · 2025-08-07T20:33:23Z

I think, given the above, there has been a regression merged albeit one that is not tested. It would be great to set up some test cases that assert the correct behavior, though, to be able to say for sure. The test above is a good start but unfortunately does not validate that the buffer mechanism itself adjusts the gauges to the correct values.

What regression do you mean? The solution that was merged into master was a necessary fix for a critical bug where the metrics were incorrect. It is fully thread-safe and its logic is confirmed by the unit tests were added in that same PR.

need to take a second look

bruceg · 2025-08-07T21:00:19Z

What regression do you mean? The solution that was merged into master was a necessary fix for a critical bug where the metrics were incorrect. It is fully thread-safe and its logic is confirmed by the unit tests were added in that same PR.

Do you have a configuration or other test case that can reproduce the original problem of negative metrics reported? I understand that may be difficult if it is a race that produces them, but within the buffers crate we can control the two halves to make that less of an issue.

bruceg · 2025-08-08T00:36:42Z

After re-examining all three patch sets, I noticed that I made a mistake in this step by dropping several negations. That skewed all the following refactoring steps and resulted in a broken result, which makes sense of my confusion at the start. I will have to rework this to resolve that.

I still believe there is a much simpler solution hiding in the resulting code, but I can see that the current code is more correct in two aspects than the original gauges:

clamping off negatives and over-large values
adding the buffer ID to the metrics to disambiguate

bruceg · 2025-08-08T14:01:29Z

I will open a new PR when I have finished fixing the refactoring. I have now discovered two separate issues, the second of which is harder to compensate for.

bruceg added 15 commits August 6, 2025 12:14

Move update_and_get within update_buffer_gauge

2a507b8

Break out counter updating from emitting gauges

f909b09

Break up counter updating and gauge emission

ec0f18b

Eliminate unnecessary i64 conversion

8cb0ac1

The counters that come in from the internal events can never contain negative values, so handling them is completely unnecessary.

Rename count and byte_size with delta_ prefix

97cb770

Move updating buffer counts into buffer usage data module

008e2da

Note that the tests on the state of the gauges have been dropped as they now would only test the proper operation of the metrics recorder.

Rename event_* buffer usage fields to delta_*

6a8456a

Move total count incrementing into CategoryMetrics

d66c418

This ends up dropping all the new tests, as they just worked on the `BUFFER_COUNTERS` map which was able to be removed with this change.

Open code fn emit_counter_gauge into callers

da5bc7e

Move single function out of cast_utils module

2c33c32

Rename internal event fields to minimize diff

b544a4f

Make macro formatting more consistent

bac3741

Add buffer_id to BufferCreated event

168bbac

Update event count and byte_size with saturating add as well

6905040

Drop the extraneous total_* fields

d7cd21f

The values in the `total_*` counters are manipulated identically to the other fields, so we can drop them now.

bruceg requested a review from pront August 6, 2025 20:03

bruceg requested a review from a team as a code owner August 6, 2025 20:03

bruceg added type: tech debt A code change that does not add user value. domain: buffers Anything related to Vector's memory/disk buffers complexity: moderate Moderate complex issue, that takes a fair bit of work to solve. labels Aug 6, 2025

pront reviewed Aug 6, 2025

View reviewed changes

lib/vector-buffers/src/buffer_usage_data.rs Show resolved Hide resolved

lib/vector-buffers/src/buffer_usage_data.rs Show resolved Hide resolved

pront previously approved these changes Aug 6, 2025

View reviewed changes

bruceg added the no-changelog Changes in this PR do not need user-facing explanations in the release changelog label Aug 6, 2025

bruceg mentioned this pull request Aug 6, 2025

enhancement(buffers, metrics): move data counting to the BufferUsage #23518

Closed

8 tasks

pront self-requested a review August 7, 2025 20:37

bruceg marked this pull request as draft August 8, 2025 00:33

pront mentioned this pull request Aug 8, 2025

revert(buffers): invalid metric reporting #23556

Closed

8 tasks

Merge branch 'master' into bruceg/rework-buffer-counters

ae15f1c

bruceg closed this Aug 8, 2025

bruceg deleted the bruceg/rework-buffer-counters branch August 8, 2025 14:01

bruceg mentioned this pull request Aug 8, 2025

chore(buffers): Rework gauge metric handling #23561

Merged

8 tasks

esensar mentioned this pull request Feb 20, 2026

vector_buffer_events & vector_buffer_byte_size metrics stuck on non-zero value #23995

Open

Conversation

bruceg commented Aug 6, 2025 • edited by pront Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Vector configuration

How did you test this PR?

Change Type

Is this a breaking change?

Does this PR include user facing changes?

References

Notes

Uh oh!

Uh oh!

Uh oh!

pront left a comment

Choose a reason for hiding this comment

Uh oh!

vparfonov commented Aug 6, 2025

Uh oh!

vparfonov commented Aug 6, 2025

Uh oh!

bruceg commented Aug 6, 2025

Uh oh!

bruceg commented Aug 6, 2025

Uh oh!

vparfonov commented Aug 7, 2025

Uh oh!

vparfonov commented Aug 7, 2025

Uh oh!

vparfonov commented Aug 7, 2025

Uh oh!

pront commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vparfonov commented Aug 7, 2025

Uh oh!

bruceg commented Aug 7, 2025

Uh oh!

pront commented Aug 7, 2025

Uh oh!

vparfonov commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bruceg commented Aug 7, 2025

Uh oh!

bruceg commented Aug 8, 2025

Uh oh!

bruceg commented Aug 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bruceg commented Aug 6, 2025 •

edited by pront

Loading

pront commented Aug 7, 2025 •

edited

Loading

vparfonov commented Aug 7, 2025 •

edited

Loading